## Loading required package: xml2
## Loading required package: NLP
## Loading required package: RColorBrewer

Introduction

 

The goal of this assignment is to explore text mining and information retrieval components using the web as a source. Our main source will be the IMDb website.
We will use it to retrive information about movies and tv shows, directors, casts and user reviews using R tools, mainly the rvest package. After retrieving the desired information we will use text mining tools in R (tm package) to manage, organize and transform this information. Finally we will use predictive and analysis tools in hope to get a better understanding of the underlying rules beneath some concepts, like a review score or a movie genre.
 
 
 

Information Retrieval

 
 

Find basic movie information based on a query string

 

The first feature consists of retriving basic information about movies based on a user chosen string. Since a string can correspond to numerous different titles, we will return a list.

We agreed that we should use the title ID as the way of representing a title.
For the first function we will use the user query string to return a list of title id’s related to the query. This function also has an optional argument max that limits the number of titles retrieved.

searchTitle <- function(query, max=200){
  query <- URLencode(query)
  resultPage <- read_html(str_interp("http://www.imdb.com/find?q=${query}&s=tt"))
  filmList <- html_nodes(resultPage, ".findList")
  filmList <- html_children(filmList)
  listCount <- length(filmList)
  if(!listCount){ 
    return(NULL)
  }
  if(max < listCount){
    listCount = max
  }
  movieList <- c()
  for(i in 1:listCount){
    movieAnchor <- html_children(html_children(filmList[i])[2])[1]
    movie <- str_split(movieAnchor,"/")[[1]][3]
    movieList <- c(movieList, movie)
  }
  return(movieList)
}

After having a way to find the id of the movie we which to get information of, we can use the imdb movie page to get various details, so we made various functions to get specific informations but to prevent that we request the movie page various times we made it possible to pass the page to to the function. This way we only need to request the page once even if we need various details.

After having a title ID we can retrieve all the relevant details about it. We structured the retrieval of detail to isolated functions so it is easier to add or remove the information that actually want. Those functions receive the title id and an optional page. Using this page means that we do not need to request the page, so we maintain speed and efficiency.

getTitle <- function(movieID, page=NULL){
  if(is.null(page)){
    page <- read_html(str_interp("http://www.imdb.com/title/${movieID}"))
  }
  return(html_text(html_nodes(page, "#title-overview-widget h1")))
}

getYear <- function(movieID, page=NULL){
  if(is.null(page)){
    page <- read_html(str_interp("http://www.imdb.com/title/${movieID}"))
  }
  return(page %>% html_nodes("#titleYear a") %>% html_text() %>% as.numeric())
}

getCast <- function(movieID, page=NULL){
  if(is.null(page)){
    page <- read_html(str_interp("http://www.imdb.com/title/${movieID}"))
  }
  return(page %>% html_nodes("#titleCast .itemprop span") %>% html_text())
}


getDirector <- function(movieID, page=NULL){
  if(is.null(page)){
    page <- read_html(str_interp("http://www.imdb.com/title/${movieID}"))
  }
  return(html_nodes(page, ".credit_summary_item .itemprop")[1] %>% html_text())
}

We aggregate all the details in the getDetails function. This function receives the title ID and returns a vector with the various details.

getDetails <- function(movieID){
  link <- str_interp("http://www.imdb.com/title/${movieID}")
  page <- read_html(link)
  description <- c(link=link, title=getTitle(movieID,page), year=getYear(movieID,page),
                   director=getDirector(movieID,page), cast=getCast(movieID,page))
  return(description)
}

 

Collect reviews of a movie

 

  We didn’t find any IMDB url that showed all reviews of a movie but it is possible to construct an url that retrieves a chosen the number of reviews. A way to do it would be to first find how many reviews the movie has and then query that many reviews.
  However we found an hyperlink in the reviews page that finds all the reviews (but not with the full information) and by changing the url we can get all the reviews. The page then has all the reviews in text and the score so we could extract it. We made a function that receives a movie id and returns a list of reviews (text and score). We also added the title of the review to the text because it could be important.

extractScore <- function(html_node) {
  scoreImg <- html_node(html_node, "h2+ img")
  if(!is.na(scoreImg)){
    score <- html_attr(scoreImg, "alt")
    score <- as.integer(str_split(score,"/")[[1]][1])
  }
  else{
    score <- NA
  }
  return(score)
}

getReviews <- function(movie_id, count = 0){
  indexReviewsPage <- read_html(str_interp("http://www.imdb.com/title/${movie_id}/reviews-index?"))
  showAllPartialUrl <- html_nodes(indexReviewsPage, "table+ table a+ a") %>% html_attr("href")
  if(count > 0){
    showAllPartialUrl <- gsub("count=(\\d)+", str_interp("count=${count}"), showAllPartialUrl)
  }
  showAllReviewsPartialUrl <-gsub("-index","", showAllPartialUrl)
  showAllReviewsUrl <- str_interp("http://www.imdb.com/title/${movie_id}/${showAllReviewsPartialUrl}")
  listReviewsPage <- read_html(showAllReviewsUrl) 
  reviewsNodeList <- html_nodes(listReviewsPage, "#tn15content div+ p , hr+ div")
  reviews <- list()
  for(i in seq(1, length(reviewsNodeList), 2)){
    score <- extractScore(reviewsNodeList[i])          
    title <- html_text(html_node(reviewsNodeList[i], "h2"))
    text <- html_text(reviewsNodeList[i+1])
    text <- paste(title, text, "\n")
    reviews$scores <- c(reviews$scores, score)
    reviews$text <- c(reviews$text, text)
  }
  return(reviews)
}

We noticed that some of the reviews didn’t have a score, so we made a function to remove those.

removeNaScores <- function(X){
  X$text <- X$text[!is.na(X$scores)]
  X$scores <- X$scores[!is.na(X$scores)]
  return(X)
}

 
 

Text Mining

 

Once we have a list of reviews we can form a VCorpus. We use VectorSource(reviewsList$text) as the input.

VCorpus(VectorSource(movieReviewsList$text))

 

Text Transformations

 

We use various text transformations to have a more meaningfull set of documents.

We remove whitespace, such as paragraphs and tabs.

tm_map(reviews, stripWhitespace)

We transform all words to lower case.

tm_map(reviews, content_transformer(tolower))

We remove the english stop words (since we assume all reviews are in english) to remove the most common words.

tm_map(reviews, removeWords, stopwords("english"))

We stem the document, leaving each word to it’s root.

tm_map(reviews, stemDocument)

After creating the document term matrix we had 13k terms with the reviews from “Kill Bill”.
However, after analysing the matrix we found that there were a lot of terms with ponctuation or symbols at the beggining or the end of the terms. So we created a function to remove those.

f <- content_transformer(function(x, pattern, sub) gsub(pattern, sub, x))
tm_map(reviews, f, "\\W|\\d|_", " ")

Applying this transformation before the removal of the stopwords, we have a document term matrix with 9k terms.

Wordcloud

We formed a wordcloud with all the reviews for each score